{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# \ud83d\udcc3 Solution for Exercise M7.01\n",
    "\n",
    "In this exercise we will define dummy classification baselines and use them as\n",
    "reference to assess the relative predictive performance of a given model of\n",
    "interest.\n",
    "\n",
    "We illustrate those baselines with the help of the Adult Census dataset, using\n",
    "only the numerical features for the sake of simplicity."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "adult_census = pd.read_csv(\"../datasets/adult-census-numeric-all.csv\")\n",
    "data, target = adult_census.drop(columns=\"class\"), adult_census[\"class\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First, define a `ShuffleSplit` cross-validation strategy taking half of the\n",
    "samples as a testing at each round. Let us use 10 cross-validation rounds."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# solution\n",
    "from sklearn.model_selection import ShuffleSplit\n",
    "\n",
    "cv = ShuffleSplit(n_splits=10, test_size=0.5, random_state=0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, create a machine learning pipeline composed of a transformer to\n",
    "standardize the data followed by a logistic regression classifier."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# solution\n",
    "from sklearn.pipeline import make_pipeline\n",
    "from sklearn.preprocessing import StandardScaler\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "\n",
    "classifier = make_pipeline(StandardScaler(), LogisticRegression())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Compute the cross-validation (test) scores for the classifier on this dataset.\n",
    "Store the results pandas Series as we did in the previous notebook."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# solution\n",
    "from sklearn.model_selection import cross_validate\n",
    "\n",
    "cv_results_logistic_regression = cross_validate(\n",
    "    classifier, data, target, cv=cv, n_jobs=2\n",
    ")\n",
    "\n",
    "test_score_logistic_regression = pd.Series(\n",
    "    cv_results_logistic_regression[\"test_score\"], name=\"Logistic Regression\"\n",
    ")\n",
    "test_score_logistic_regression"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, compute the cross-validation scores of a dummy classifier that constantly\n",
    "predicts the most frequent class observed the training set. Please refer to\n",
    "the online documentation for the [sklearn.dummy.DummyClassifier\n",
    "](https://scikit-learn.org/stable/modules/generated/sklearn.dummy.DummyClassifier.html)\n",
    "class.\n",
    "\n",
    "Store the results in a second pandas Series."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# solution\n",
    "from sklearn.dummy import DummyClassifier\n",
    "\n",
    "most_frequent_classifier = DummyClassifier(strategy=\"most_frequent\")\n",
    "cv_results_most_frequent = cross_validate(\n",
    "    most_frequent_classifier, data, target, cv=cv, n_jobs=2\n",
    ")\n",
    "test_score_most_frequent = pd.Series(\n",
    "    cv_results_most_frequent[\"test_score\"],\n",
    "    name=\"Most frequent class predictor\",\n",
    ")\n",
    "test_score_most_frequent"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that we collected the results from the baseline and the model, concatenate\n",
    "the test scores as columns a single pandas dataframe."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# solution\n",
    "all_test_scores = pd.concat(\n",
    "    [test_score_logistic_regression, test_score_most_frequent],\n",
    "    axis=\"columns\",\n",
    ")\n",
    "all_test_scores"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "Next, plot the histogram of the cross-validation test scores for both models\n",
    "with the help of [pandas built-in plotting\n",
    "function](https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html#histograms).\n",
    "\n",
    "What conclusions do you draw from the results?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# solution\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "bins = np.linspace(start=0.5, stop=1.0, num=100)\n",
    "all_test_scores.plot.hist(bins=bins, edgecolor=\"black\")\n",
    "plt.legend(bbox_to_anchor=(1.05, 0.8), loc=\"upper left\")\n",
    "plt.xlabel(\"Accuracy (%)\")\n",
    "_ = plt.title(\"Distribution of the CV scores\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": [
     "solution"
    ]
   },
   "source": [
    "We observe that the two histograms are well separated. Therefore the dummy\n",
    "classifier with the strategy `most_frequent` has a much lower accuracy than\n",
    "the logistic regression classifier. We conclude that the logistic regression\n",
    "model can successfully find predictive information in the input features to\n",
    "improve upon the baseline."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Change the `strategy` of the dummy classifier to `\"stratified\"`, compute the\n",
    "results. Similarly compute scores for `strategy=\"uniform\"` and then the  plot\n",
    "the distribution together with the other results.\n",
    "\n",
    "Are those new baselines better than the previous one? Why is this the case?\n",
    "\n",
    "Please refer to the scikit-learn documentation on\n",
    "[sklearn.dummy.DummyClassifier](\n",
    "https://scikit-learn.org/stable/modules/generated/sklearn.dummy.DummyClassifier.html)\n",
    "to find out about the meaning of the `\"stratified\"` and `\"uniform\"`\n",
    "strategies."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# solution\n",
    "stratified_dummy = DummyClassifier(strategy=\"stratified\")\n",
    "cv_results_stratified = cross_validate(\n",
    "    stratified_dummy, data, target, cv=cv, n_jobs=2\n",
    ")\n",
    "test_score_dummy_stratified = pd.Series(\n",
    "    cv_results_stratified[\"test_score\"], name=\"Stratified class predictor\"\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "solution"
    ]
   },
   "outputs": [],
   "source": [
    "uniform_dummy = DummyClassifier(strategy=\"uniform\")\n",
    "cv_results_uniform = cross_validate(\n",
    "    uniform_dummy, data, target, cv=cv, n_jobs=2\n",
    ")\n",
    "test_score_dummy_uniform = pd.Series(\n",
    "    cv_results_uniform[\"test_score\"], name=\"Uniform class predictor\"\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "solution"
    ]
   },
   "outputs": [],
   "source": [
    "all_test_scores = pd.concat(\n",
    "    [\n",
    "        test_score_logistic_regression,\n",
    "        test_score_most_frequent,\n",
    "        test_score_dummy_stratified,\n",
    "        test_score_dummy_uniform,\n",
    "    ],\n",
    "    axis=\"columns\",\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "solution"
    ]
   },
   "outputs": [],
   "source": [
    "all_test_scores.plot.hist(bins=bins, edgecolor=\"black\")\n",
    "plt.legend(bbox_to_anchor=(1.05, 0.8), loc=\"upper left\")\n",
    "plt.xlabel(\"Accuracy (%)\")\n",
    "_ = plt.title(\"Distribution of the test scores\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": [
     "solution"
    ]
   },
   "source": [
    "We see that using `strategy=\"stratified\"`, the results are much worse than\n",
    "with the `most_frequent` strategy. Since the classes are imbalanced,\n",
    "predicting the most frequent involves that we will be right for the proportion\n",
    "of this class (~75% of the samples). However, the `\"stratified\"` strategy will\n",
    "randomly generate predictions by respecting the training set's class\n",
    "distribution, resulting in some wrong predictions even for the most frequent\n",
    "class, hence we obtain a lower accuracy.\n",
    "\n",
    "This is even more so for the `strategy=\"uniform\"`: this strategy assigns class\n",
    "labels uniformly at random. Therefore, on a binary classification problem, the\n",
    "cross-validation accuracy is 50% on average, which is the weakest of the three\n",
    "dummy baselines."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": [
     "solution"
    ]
   },
   "source": [
    "Note: one could argue that the `\"uniform\"` or `strategy=\"stratified\"`\n",
    "strategies are both valid ways to define a \"chance level\" baseline accuracy\n",
    "for this classification problem, because they make predictions \"by chance\".\n",
    "\n",
    "Another way to define a chance level would be to use the\n",
    "[sklearn.model_selection.permutation_test_score](https://scikit-learn.org/stable/auto_examples/model_selection/plot_permutation_tests_for_classification.html)\n",
    "utility of scikit-learn. Instead of using a dummy classifier, this function\n",
    "compares the cross-validation accuracy of a model of interest to the\n",
    "cross-validation accuracy of this same model but trained on randomly permuted\n",
    "class labels. The `permutation_test_score` therefore defines a chance level\n",
    "that depends on the choice of the class and hyper-parameters of the estimator\n",
    "of interest. When training on such randomly permuted labels, many machine\n",
    "learning estimators would end up approximately behaving much like the\n",
    "`DummyClassifier(strategy=\"most_frequent\")` by always predicting the majority\n",
    "class, irrespective of the input features. As a result, this `\"most_frequent\"`\n",
    "baseline is sometimes called the \"chance level\" for imbalanced classification\n",
    "problems, even though its predictions are completely deterministic and do not\n",
    "involve much \"chance\" anymore.\n",
    "\n",
    "Defining the chance level using `permutation_test_score` is quite\n",
    "computation-intensive because it requires fitting many non-dummy models on\n",
    "random permutations of the data. Using dummy classifiers as baselines is often\n",
    "enough for practical purposes. For imbalanced classification problems, the\n",
    "`\"most_frequent\"` strategy is the strongest of the three baselines and\n",
    "therefore the one we should use."
   ]
  }
 ],
 "metadata": {
  "jupytext": {
   "main_language": "python"
  },
  "kernelspec": {
   "display_name": "Python 3",
   "name": "python3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}